Range coder

Dario Phong/Hugi

Introduction

The range coder is presented here, a variation of arithmetic coding which does renormalization in bytes instead of bits, thus running twice faster and with 0.01% worse compression than a standard implementation of arithmetic coding. The range coder is believed to be patent free.

Some experience with models and arithmetic coding is needed. In this case I recommend the quasi-static model.

Entropy and how to achieve it

The entropy is defined as -Log2(P) where Log2 is the logarithm with the base 2 and P is the probability (ranging from 0 to 1) of a given symbol. Huffman in most of the cases does not achieve the entropy, unless the probability is 2^x, in this case it performs: -Log2(4)=2 or -Log2(32)=5. But this is not what happens in most of the cases. In such cases Huffman assigns less or more bits than it should. Just an example: If the probability of a given symbol is 7/8 = 0.875, the entropy of it is: -Log2(0.875)= 0.192, but in this case Huffman will assign it 1 bit, which is by far more than it should.

But we have the arithmetic coder which more or less achieves the entropy (it actually loses some compression due to the non-infinite precision, scaled probabilities and end of data symbol). Arithmetic coding works doing an interval between 0-1 and further subdiving it, based on the cumulative probability of the symbol which is currently being processed. Now let's see the routines which the range coder uses. For adjusting the interval it uses the following code:

- r = range / cumulative_probability[last_symbol] (Divide range by the maximum cumulative probability)

- tmp = r * cumulative_probability[current_symbol]

- range = r * probability[current_symbol]

- low += tmp

Where range is initialized to 1 and low to 0. These operations should be done with infinite floating point precision. When decoding, first we have to see what symbol there is, and then update the interval.

- help = range / cumulative_probability[last_symbol]

- tmp = low / help (Now tmp holds the cumulative probability of the current symbol)

- Once we know the symbol, we can update the interval for decoding the next one.

- tmp = help * cumulative_probability[current_symbol]

- low -= tmp

- range = help * probability[current_symbol]

Symbol a b c
Probability 2 1 1
Cumulative prob. 0 2 3

Let's say we have to encode the following symbols "bab" and we have the probabilities in the table above. The encoding and decoding will look like the following.

Encode a symbol:

- r = range / cumulative_probability[last_symbol]

- tmp = r * cumulative_probability[current_symbol]

- range = r * probability[current_symbol]

- low += tmp

Action Variable Operation Result
Init range (none) 1
Init low (none) 0
Encode b r 1/4 0.25
Encode b tmp 0.25*2 0.5
Encode b range 0.25*1 0.25
Encode b low 0+0.5 0.5
Encode a r 0.25/4 0.0625
Encode a tmp 0.0625*0 0
Encode a range 0.0625*2 0.125
Encode a low 0.5+0 0.5

Decode and update:

- help = range / cumulative_probability[last_symbol]

- tmp = low / help

- (Note that based on the cumulative probability, we find the current symbol, and thus we can know its probability.)

- tmp = help * cumulative_probability[current_symbol]

- low -= tmp

- range = help * probability[current_symbol]

Action Variable Operation Result
Init range (none) 1
Init low (none) 0.5
Decode help 1/4 0.25
Output b tmp 0.5/0.25 2
Update tmp 0.25*1 0.25
Update low 0.5-0.25 0.25
Update range 1-0.25 0.75
Decode help 0.75/4 0.1875
Output a tmp 0.25/0.1875 1.3333
Update tmp 0.1875*0 0
Update low 0.25-0 0.25
Update range 0.1875*2 0.375

More or less what we have done, is represented by the pic at the left. Note how the interval is first adjusted to represent 'b', and the new interval is again divided by the number of probabilities, and the range of 'a' is chosen, again, the same is done for 'b'. (In case we have to encode b.)

Encoder implementation

The encoder is made of four routines: init, renormalize, encode a symbol and flush encoder. Init must be called before starting to encode symbols, both encode and renormalize code the symbols, and flushing is done when you have encoded all the symbols. The next section discusses the pseudo code.

Encoder Init:

- Low = 0 (Both low and range keep track of the number)

- Range = Top_value (Top_value is a defined macro)

- Buffer = 0 (This is the first byte to output)

- Help = 0 (No underflow bytes)

- Bytecount = 0 (No bytes output so far)

Variables:

- Low and range. They have the floating point number.

- Buffer. This is the byte which we have to output. The first time we output a byte which will not have any information of the coding, but which is needed to avoid an "if" to see if we don't have changed the byte yet (in the renormalization).

- Help. This variable keeps track of the underflow bytes.

- Bytecount. It keeps track of the number of bytes that the encoder has output. Only needed for error recovery or if you want to do an archiver.

Encoder renormalize:

- While ( range <= Bottom_value )( As long as it's not above than Bottom_value, which means we can still output)

- if ( low < ( 0FFh << Shift_bits ) ) (If higher part is 0FFh a carry will occur, which we can't afford)

- Yes (We had a range with a potential carry (which we can't capture with our precision))

- Output_byte ( buffer ) (Output the byte)

- For ( ; help != 0 ; --help ) (Output all underflow bytes pending for output)

- Output_byte ( 0FFh )

- Buffer = low >> Shift_bits (Assign to the byte to output the high bytes of low)

- No (Now check for actual carry)

- if ( ( low And Top_value) != 0) (Check carry bit)

- Yes

- Output_byte ( buffer + 1 ) (The +1 comes from the carry which propagates till the byte saved)

- For ( ; help != 0 ; --help ) (Output all underflow bytes pending for output)

- Output_byte ( 00h ) (0FFh becomes 0 because of the carry!)

- Buffer = low >> Shift_bits (Assign to the byte to output the high bytes of low)

- No

- ++Buffer

- Range << 8 (Shift out byte which we have already output or are going to output, in case of 0FFh)

- Low << 8

- Low And ( Top_value - 1 ) (Clear carry bit)

- Repeat

Explanation of predefined values like Bottom_value comes at the end of this section. Encoder encode symbol:

- Renormalize( ); (Before encoding, renormalize)

- R = range / tot_f (New range for all the symbols)

- Tmp = r * lt_f (Start of the range of the current symbol)

- If ( lt_f + sy_f < tot_f ) (Last frequency?)

- Yes (For all the symbols but the last)

- Range = r * sy_f (Assign to range the range of the current range)

- No (For the last symbol)

- Range -= tmp

- Low += tmp (Update low)

Variables:

- Tot_f. This is the maximum cumulative probability.

- Lt_f. This is the cumulative probability of the symbol before the one to encode (low part of range, in a simple order-0 model this should be cump[symbol]).

- Sy_f. How big the range of our symbol is. In a simple order-0 model, it will be: freq[symbol+1]-freq[symbol]

Encoder flush:

- Renormalize( ); (Renormalize state before flushing)

- Temp = low >> 23 (Output two bytes more, and Buffer)

- If ( temp > 0FFh ) (Check for overflow)

- Output_byte ( Buffer+1 ) (There was overflow)

- For ( ; help != 0 ; --help ) (Output all underflow bytes pending for output)

- Output_byte ( 00h ) (0FFh becomes 0 because of the carry!)

- Else

- Output_byte ( Buffer ) (No overflow)

- For ( ; help != 0 ; --help )

- Output_byte ( 0FFh )

- Output_byte ( temp & 0FFh) (Output Low)

- Output_byte ( (temp = low >> (23-8)) & 0FFh)

Predefined values:

- Top_value. 80000000h (hex).

- Bottom_value. 00800000h.

- Shift_bits. 23.

- Extra_bits. 7.

Decoder implementation

The decoding is made of five routines: init, renormalize, decode symbol, update state and flush. Update state must be called after knowing the symbol's range, so you can extract it from the number.

Decoder init:

- C = inbyte(); (The first byte, a dummy one)

- Buffer = inbyte(); (Get first encoded byte)

- Low = buffer >> ( 8 - Extra_bits ) (Init low)

- Range = 1 << Extra_bits (Init range)

Decoder renormalize:

- While ( range <= Bottom_value ) (Read the bytes that the encode output)

- Low = ( low << 8 ) Or ( ( buffer << Extra_bits ) And 0FFh ) (Put bits in low)

- Buffer = inbyte( );

- Low = Low Or ( buffer >> ( 8 - Extra_bits) )

- Repeat

Decoder decode symbol:

- Renormalize( ); (Remember to use the decoder version in the decompressor)

- Help = range / tot_f (The range for all the symbols. Note that help this time doesn't keep track of the underflow bytes, but rather it helps with decoding.)

- Tmp = low / help (Get the range of the current symbol)

- If ( tmp > tot_f )

- Yes

- Return tot_f (The range is the maximum possible)

- No

- Return tmp (The cumulative probability)

Decode update state:

- Tmp = help * lt_f

- low -= tmp (Extract number from low)

- If ( lt_f + sy_f < tot_f ) (Check to see if that is symbol with highest frequency)

- Yes

- Range = help * sy_f (Extract number from range)

- No

- Range -= tmp

Decoder flush:

Renormalize ( ); (Read the bytes that the encoder wrote, in case it did)

Implementation

Low holds the floating point number. In the theory part we use floating point variables. For implementations we use integer math, and instead of 0-1 range we use 0-FFFFFFFFh (dependent on the implementation), which is a 32 bit value. The bit 31 (at the left) is reserved to detect carry (when adding tmp). In bits 30-22 we perform renormalization. The rest of the bits from 21-0 give the high precision. It is estimated that you can code 2^23 symbols without any problems.

When the range is below 800000h we know that the renormalization part of low will never change, so we can output it. In order not to suffer underflow, when the renormalization part of low (bits 30-22) is 0FFh we extract them and keep a counter of how many underflow bytes we have discarded. Once we can output the renormalization part, we can output all those bytes too, but in case that carry has occurred, instead of 0FFh we have to output 00h. (Obviously if we have a value like 1FFFFh and we have carry, we must output 20000h instead, you see?)

If you are using a low level language you may want to check directly the carry, without using bit 31 for checking it, thus incrementing precision. If you are interested in such a implementation contact Michael Schindler, who is able to tell you how to implement it.

About the byte based IO (input/output). If you use a high level language, you can skip this paragraph, as the compiler will do this itself. When doing IO you can't directly output a byte as soon as it's available. That's very slow. Instead you keep a buffer and when it's full or when you have to flush it (end of compression) you write it to the output file. This has a great impact on speed. Time is reported in machine clocks. The following report comes from a file paper, from the Calgary Corpus set.

Buffer size (in bytes) Clocks
1 476679208
2 52550010
4 32170836
8 16429902
16 11381554
32 7301852
64 4177316
128 3673010
256 3259018
512 3121538
1024 3117984
2048 3018602
4096 3010252
8192 3018116

Models

In this article I've only discussed the routines to encode, now you need a model which gives the probabilities (in form of cumulative probabilities). You can use a simple static order-0 model, as presented for arithmetic coding. Or you can use a Quasi Static model, which is a good decision. You may want to use the data structures presented by Peter Fenwick in his paper "A New Data Structure for Cumulative Probability Tables: an Improved Frequency-to-Symbol Algorithm." The range coder can be used with ppm just as arithmetic coding is used, there's no difference in its use.

References

The range coder package. Available here.

Michael Schindler, private communication.


Dario Phong/Hugi